---
title: Accessing Hadoop
---

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->

PXF is compatible with Cloudera, Hortonworks Data Platform, and generic Apache Hadoop distributions. PXF is installed with HDFS, Hive, and HBase connectors. You use these connectors to access varied formats of data from these Hadoop distributions.

## <a id="hdfs_arch"></a>Architecture

HDFS is the primary distributed storage mechanism used by Apache Hadoop. When a user or application performs a query on a PXF external table that references an HDFS file, the Greenplum Database coordinator host dispatches the query to all segment instances. Each segment instance contacts the PXF Service running on its host. When it receives the request from a segment instance, the PXF Service:

1. Allocates a worker thread to serve the request from the segment instance.
2. Invokes the HDFS Java API to request metadata information for the HDFS file from the HDFS NameNode. 

<span class="figtitleprefix">Figure: </span>PXF-to-Hadoop Architecture

![Greenplum Platform Extenstion Framework to Hadoop Architecture](graphics/pxfarch.png "Greenplum Platform Extension Framework-to-Hadoop Architecture")

A PXF worker thread works on behalf of a segment instance. A worker thread uses its Greenplum Database `gp_segment_id` and the file block information described in the metadata to assign itself a specific portion of the query data. This data may reside on one or more HDFS DataNodes.

The PXF worker thread invokes the HDFS Java API to read the data and delivers it to the segment instance. The segment instance delivers its portion of the data to the Greenplum Database coordinator host. This communication occurs across segment hosts and segment instances in parallel.


## <a id="hadoop_prereq"></a>Prerequisites

Before working with Hadoop data using PXF, ensure that:

- You have configured PXF, and PXF is running on each Greenplum Database host. See [Configuring PXF](instcfg_pxf.html) for additional information.
- You have configured the PXF Hadoop Connectors that you plan to use. Refer to [Configuring PXF Hadoop Connectors](client_instcfg.html) for instructions. If you plan to access JSON-formatted data stored in a Cloudera Hadoop cluster, PXF requires a Cloudera version 5.8 or later Hadoop distribution.
- If user impersonation is enabled (the default), ensure that you have granted read (and write as appropriate) permission to the HDFS files and directories that will be accessed as external tables in Greenplum Database to each Greenplum Database user/role name that will access the HDFS files and directories. If user impersonation is not enabled, you must grant this permission to the `gpadmin` user.
- Time is synchronized between the Greenplum Database hosts and the external Hadoop systems.


## <a id="hdfs_cmdline"></a>HDFS Shell Command Primer
Examples in the PXF Hadoop topics access files on HDFS. You can choose to access files that already exist in your HDFS cluster. Or, you can follow the steps in the examples to create new files.

A Hadoop installation includes command-line tools that interact directly with your HDFS file system. These tools support typical file system operations that include copying and listing files, changing file permissions, and so forth. You run these tools on a system with a Hadoop client installation. By default, Greenplum Database hosts do not
include a Hadoop client installation.

The HDFS file system command syntax is `hdfs dfs <options> [<file>]`. Invoked with no options, `hdfs dfs` lists the file system options supported by the tool.

The user invoking the `hdfs dfs` command must have read privileges on the HDFS data store to list and view directory and file contents, and write permission to create directories and files.

The `hdfs dfs` options used in the PXF Hadoop topics are:

| Option  | Description |
|-------|-------------------------------------|
| `-cat`    | Display file contents. |
| `-mkdir`    | Create a directory in HDFS. |
| `-put`    | Copy a file from the local file system to HDFS. |

Examples:

Create a directory in HDFS:

``` shell
$ hdfs dfs -mkdir -p /data/pxf_examples
```

Copy a text file from your local file system to HDFS:

``` shell
$ hdfs dfs -put /tmp/example.txt /data/pxf_examples/
```

Display the contents of a text file located in HDFS:

``` shell
$ hdfs dfs -cat /data/pxf_examples/example.txt
```

## <a id="hadoop_connectors"></a>Connectors, Data Formats, and Profiles

The PXF Hadoop connectors provide built-in profiles to support the following data formats:

- Text
- CSV
- Avro
- JSON
- ORC
- Parquet
- RCFile
- SequenceFile
- AvroSequenceFile

The PXF Hadoop connectors expose the following profiles to read, and in many cases write, these supported data formats:

| Data Source | Data Format | Profile Name(s) | Deprecated Profile Name | Supported Operations |
|-------------|------|---------|-----|-----|
| HDFS | delimited single line [text](hdfs_text.html#profile_text) | hdfs:text | n/a | Read, Write |
| HDFS | delimited single line comma-separated values of [text](hdfs_text.html#profile_text) | hdfs:csv | n/a | Read, Write |
| HDFS | multi-byte or multi-character delimited single line [csv](hdfs_text.html#multibyte_delim) | hdfs:csv | n/a | Read |
| HDFS | fixed width single line [text](hdfs_fixedwidth.html) | hdfs:fixedwidth | n/a | Read, Write |
| HDFS | delimited [text with quoted linefeeds](hdfs_text.html#profile_textmulti) | hdfs:text:multi | n/a | Read |
| HDFS | [Avro](hdfs_avro.html) | hdfs:avro | n/a | Read, Write |
| HDFS | [JSON](hdfs_json.html) | hdfs:json | n/a | Read, Write |
| HDFS | [ORC](hdfs_orc.html) | hdfs:orc | n/a | Read, Write |
| HDFS | [Parquet](hdfs_parquet.html) | hdfs:parquet | n/a | Read, Write |
| HDFS | AvroSequenceFile | hdfs:AvroSequenceFile | n/a | Read, Write |
| HDFS | [SequenceFile](hdfs_seqfile.html) | hdfs:SequenceFile | n/a | Read, Write |
| [Hive](hive_pxf.html) | stored as TextFile | hive, [hive:text] (hive_pxf.html#hive_text) | Hive, HiveText | Read |
| [Hive](hive_pxf.html) | stored as SequenceFile | hive | Hive | Read |
| [Hive](hive_pxf.html) | stored as RCFile | hive, [hive:rc](hive_pxf.html#hive_hiverc) | Hive, HiveRC | Read |
| [Hive](hive_pxf.html) | stored as ORC | hive, [hive:orc](hive_pxf.html#hive_orc) | Hive, HiveORC, HiveVectorizedORC | Read |
| [Hive](hive_pxf.html) | stored as Parquet | hive | Hive | Read |
| [Hive](hive_pxf.html) | stored as Avro | hive | Hive | Read |
| [HBase](hbase_pxf.html) | Any | hbase | HBase | Read |

### <a id="choose_profile"></a>Choosing the Profile

PXF provides more than one profile to access text and Parquet data on Hadoop. Here are some things to consider as you determine which profile to choose.

Choose the `hive` profile when:

- The data resides in a Hive table, and you do not know the underlying file type of the table up front.
- The data resides in a Hive table, and the Hive table is partitioned.

Choose the `hdfs:text`, `hdfs:csv` profiles when the file is text and you know the location of the file in the HDFS file system.

When accessing ORC-format data:

- Choose the `hdfs:orc` profile when the file is ORC, you know the location of the file in the HDFS file system, and the file is not managed by Hive or you do not want to use the Hive Metastore.
- Choose the `hive:orc` profile when the table is ORC and the table is managed by Hive, and the data is partitioned or the data includes complex types.

Choose the `hdfs:parquet` profile when the file is Parquet, you know the location of the file in the HDFS file system, and you want to take advantage of extended filter pushdown support for additional data types and operators.


### <a id="specify_profile"></a>Specifying the Profile

You must provide the profile name when you specify the `pxf` protocol in a `CREATE EXTERNAL TABLE` command to create a Greenplum Database external table that references a Hadoop file or directory, HBase table, or Hive table. For example, the following command creates an external table that uses the default server and specifies the profile named `hdfs:text` to access the HDFS file `/data/pxf_examples/pxf_hdfs_simple.txt`:

``` sql
CREATE EXTERNAL TABLE pxf_hdfs_text(location text, month text, num_orders int, total_sales float8)
   LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=hdfs:text')
FORMAT 'TEXT' (delimiter=E',');
```
